from IPython.display import Image
Image(filename="D:\IBM_Data_Science\Image\Articles-Customer-Segmentation_Eng_A.png",width=700,height=300)
What does customer segmentation mean?
Customer segmentation is the process of dividing a customer base into distinct groups of individuals that have similar characteristics. This process makes it easier to target specific groups of customers with tailored products, services, and marketing strategies. By segmenting customers into different classes, businesses can better understand their needs, preferences, and buying patterns, allowing them to create more personalized and effective marketing campaigns.
What are the 4 types of customer segmentation?
Customer segmentation is the practice of dividing customers into distinct groups with common characteristics. Examples of customer segmentation include geographic segmentation (dividing customers by region), demographic segmentation (dividing customers by age, gender, marital status, etc.), behavioral segmentation (dividing customers by purchase behavior, usage patterns, loyalty, etc.), and psychographic segmentation (dividing customers by attitudes, values, lifestyle, etc.).
In the second part of the project, we will use the clustering technique known as K-Means, a popular unsupervised machine learning algorithm used for cluster analysis. The algorithm groups similar data points into clusters based on their similarity, and it is often used in marketing and customer segmentation The correct implementation of the algorithm can lead to clearer and more successful business decisions All of the modules and library used for this part of the project are listed below
Customer Segmentation with K-Means
In the second part of the project, we will use the clustering technique known as K-Means, a popular unsupervised machine learning algorithm used for cluster analysis. The algorithm groups similar data points into clusters based on their similarity, and it is often used in marketing and customer segmentation.
The correct implementation of the algorithm can lead to clearer and more successful business decisions.
All of the modules and library used for this part of the project are listed below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
import plotly.express as px
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings("ignore")
## Import Dataset
df = pd.read_csv("D:\IBM_Data_Science\Data_set\Data_Kagle_Github\clustering.csv")
print(f"Number of clients: {df.shape[0]}\nNumber of features: {df.shape[1]}")
Number of clients: 2237 Number of features: 29
pd.set_option("display.max_columns",29)
df.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Total_Purchases | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1617 | 66 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27 | 69 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 776 | 58 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 53 | 39 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 422 | 42 |
In order to perform proper customer segmentation, we will use the RMF technique, which will be explained further along the project. For that we need the customers' frequency, which is the number of purchases made in a given period of time.
We don't have that value explicitally, but we can calculate it but summing the number of purhases made by our customers through different channels (Deals, Web, etc) and assign it to df['Frequency'].
df.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 2237.0 | 5587.024139 | 3245.458565 | 0.0 | 2826.0 | 5454.0 | 8420.0 | 11191.0 |
| Year_Birth | 2237.0 | 1968.866786 | 11.793625 | 1899.0 | 1959.0 | 1970.0 | 1977.0 | 1996.0 |
| Income | 2237.0 | 51973.077747 | 21417.001805 | 1730.0 | 35523.0 | 51717.0 | 68281.0 | 162397.0 |
| Kidhome | 2237.0 | 0.443898 | 0.538421 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Teenhome | 2237.0 | 0.506482 | 0.544593 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Recency | 2237.0 | 49.110416 | 28.952136 | 0.0 | 24.0 | 49.0 | 74.0 | 99.0 |
| MntWines | 2237.0 | 304.329012 | 336.651531 | 0.0 | 24.0 | 174.0 | 505.0 | 1493.0 |
| MntFruits | 2237.0 | 26.328565 | 39.793044 | 0.0 | 1.0 | 8.0 | 33.0 | 199.0 |
| MntMeatProducts | 2237.0 | 167.160036 | 225.793687 | 0.0 | 16.0 | 67.0 | 232.0 | 1725.0 |
| MntFishProducts | 2237.0 | 37.565937 | 54.654409 | 0.0 | 3.0 | 12.0 | 50.0 | 259.0 |
| MntSweetProducts | 2237.0 | 27.097005 | 41.297645 | 0.0 | 1.0 | 8.0 | 33.0 | 263.0 |
| MntGoldProds | 2237.0 | 44.063478 | 52.188898 | 0.0 | 9.0 | 24.0 | 56.0 | 362.0 |
| NumDealsPurchases | 2237.0 | 2.325436 | 1.932803 | 0.0 | 1.0 | 2.0 | 3.0 | 15.0 |
| NumWebPurchases | 2237.0 | 4.087617 | 2.779366 | 0.0 | 2.0 | 4.0 | 6.0 | 27.0 |
| NumCatalogPurchases | 2237.0 | 2.664730 | 2.924095 | 0.0 | 0.0 | 2.0 | 4.0 | 28.0 |
| NumStorePurchases | 2237.0 | 5.794814 | 3.250624 | 0.0 | 3.0 | 5.0 | 8.0 | 13.0 |
| NumWebVisitsMonth | 2237.0 | 5.316942 | 2.428060 | 0.0 | 3.0 | 6.0 | 7.0 | 20.0 |
| AcceptedCmp3 | 2237.0 | 0.072865 | 0.259974 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp4 | 2237.0 | 0.074654 | 0.262890 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp5 | 2237.0 | 0.072865 | 0.259974 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp1 | 2237.0 | 0.064372 | 0.245469 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp2 | 2237.0 | 0.013411 | 0.115052 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Complain | 2237.0 | 0.008941 | 0.094152 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Response | 2237.0 | 0.149307 | 0.356471 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Total_Purchases | 2237.0 | 606.544032 | 602.307901 | 5.0 | 69.0 | 397.0 | 1047.0 | 2525.0 |
| Age | 2237.0 | 54.101922 | 11.700664 | 27.0 | 46.0 | 53.0 | 64.0 | 83.0 |
Check information In DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2237 entries, 0 to 2236 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2237 non-null int64 1 Year_Birth 2237 non-null int64 2 Education 2237 non-null object 3 Marital_Status 2237 non-null object 4 Income 2237 non-null float64 5 Kidhome 2237 non-null int64 6 Teenhome 2237 non-null int64 7 Dt_Customer 2237 non-null object 8 Recency 2237 non-null int64 9 MntWines 2237 non-null int64 10 MntFruits 2237 non-null int64 11 MntMeatProducts 2237 non-null int64 12 MntFishProducts 2237 non-null int64 13 MntSweetProducts 2237 non-null int64 14 MntGoldProds 2237 non-null int64 15 NumDealsPurchases 2237 non-null int64 16 NumWebPurchases 2237 non-null int64 17 NumCatalogPurchases 2237 non-null int64 18 NumStorePurchases 2237 non-null int64 19 NumWebVisitsMonth 2237 non-null int64 20 AcceptedCmp3 2237 non-null int64 21 AcceptedCmp4 2237 non-null int64 22 AcceptedCmp5 2237 non-null int64 23 AcceptedCmp1 2237 non-null int64 24 AcceptedCmp2 2237 non-null int64 25 Complain 2237 non-null int64 26 Response 2237 non-null int64 27 Total_Purchases 2237 non-null int64 28 Age 2237 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 506.9+ KB
# Count of null value
df.isnull().sum()
ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 0 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Response 0 Total_Purchases 0 Age 0 dtype: int64
# Count row and column
df.shape
(2237, 29)
## Check the interaction between variables
## Covariance matrix
hm=df.corr()
hm
| ID | Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Total_Purchases | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 1.000000e+00 | 0.006260 | -8.012841e-07 | 0.002160 | -0.002481 | -0.046002 | -0.021528 | 0.005420 | -0.003341 | -0.023630 | -0.006655 | -0.012417 | -0.036954 | -0.017685 | -0.002289 | -0.013269 | -0.007159 | -0.035636 | -0.024965 | -0.007082 | -0.021223 | -0.014892 | 0.031625 | -0.021330 | -0.016604 | -0.002652 |
| Year_Birth | 6.260469e-03 | 1.000000 | -2.011147e-01 | 0.234200 | -0.357774 | -0.018042 | -0.165010 | -0.021361 | -0.035101 | -0.045233 | -0.021592 | -0.065954 | -0.065923 | -0.152577 | -0.127285 | -0.136666 | 0.121202 | 0.061368 | -0.063000 | 0.005793 | -0.007387 | -0.007252 | -0.004163 | 0.019520 | -0.118099 | -0.992095 |
| Income | -8.012841e-07 | -0.201115 | 1.000000e+00 | -0.510538 | 0.033721 | 0.007618 | 0.686263 | 0.505592 | 0.684526 | 0.518786 | 0.518396 | 0.384645 | -0.107538 | 0.450341 | 0.693913 | 0.627274 | -0.646814 | -0.015325 | 0.217420 | 0.395311 | 0.324942 | 0.103900 | -0.027948 | 0.160671 | 0.789544 | 0.198770 |
| Kidhome | 2.159591e-03 | 0.234200 | -5.105382e-01 | 1.000000 | -0.034983 | 0.008129 | -0.496385 | -0.372747 | -0.437180 | -0.387666 | -0.370693 | -0.349750 | 0.221685 | -0.361915 | -0.502529 | -0.499878 | 0.447897 | 0.014840 | -0.161553 | -0.205618 | -0.172309 | -0.081704 | 0.036366 | -0.079835 | -0.556864 | -0.233855 |
| Teenhome | -2.481094e-03 | -0.357774 | 3.372061e-02 | -0.034983 | 1.000000 | 0.016931 | 0.004487 | -0.176969 | -0.261669 | -0.204576 | -0.162838 | -0.021824 | 0.388462 | 0.155632 | -0.110972 | 0.050394 | 0.135255 | -0.042823 | 0.038790 | -0.191288 | -0.140288 | -0.015664 | 0.007591 | -0.154730 | -0.138899 | 0.363105 |
| Recency | -4.600180e-02 | -0.018042 | 7.618002e-03 | 0.008129 | 0.016931 | 1.000000 | 0.016042 | -0.004309 | 0.023074 | 0.001068 | 0.022610 | 0.016423 | -0.000499 | -0.010769 | 0.024986 | 0.000882 | -0.021473 | -0.033036 | 0.018836 | 0.000119 | -0.019313 | -0.001787 | 0.005380 | -0.198679 | 0.020402 | 0.019397 |
| MntWines | -2.152827e-02 | -0.165010 | 6.862631e-01 | -0.496385 | 0.004487 | 0.016042 | 1.000000 | 0.389327 | 0.562324 | 0.399392 | 0.386157 | 0.387116 | 0.010756 | 0.541899 | 0.634960 | 0.641681 | -0.320997 | 0.061909 | 0.373165 | 0.472551 | 0.354024 | 0.205878 | -0.036016 | 0.246920 | 0.891724 | 0.162739 |
| MntFruits | 5.420492e-03 | -0.021361 | 5.055923e-01 | -0.372747 | -0.176969 | -0.004309 | 0.389327 | 1.000000 | 0.542912 | 0.594663 | 0.567000 | 0.392760 | -0.132427 | 0.296363 | 0.487687 | 0.461467 | -0.418645 | 0.014545 | 0.009967 | 0.215697 | 0.194618 | -0.009852 | -0.002933 | 0.125048 | 0.614073 | 0.013618 |
| MntMeatProducts | -3.340869e-03 | -0.035101 | 6.845257e-01 | -0.437180 | -0.261669 | 0.023074 | 0.562324 | 0.542912 | 1.000000 | 0.568188 | 0.523576 | 0.350258 | -0.122682 | 0.293276 | 0.723660 | 0.479189 | -0.539824 | 0.018018 | 0.102687 | 0.373649 | 0.309631 | 0.042939 | -0.020810 | 0.236050 | 0.842860 | 0.030701 |
| MntFishProducts | -2.362977e-02 | -0.045233 | 5.187856e-01 | -0.387666 | -0.204576 | 0.001068 | 0.399392 | 0.594663 | 0.568188 | 1.000000 | 0.579681 | 0.422630 | -0.139575 | 0.293309 | 0.534257 | 0.459514 | -0.446233 | 0.000150 | 0.016637 | 0.199421 | 0.260633 | 0.002491 | -0.018887 | 0.111056 | 0.642633 | 0.042341 |
| MntSweetProducts | -6.655384e-03 | -0.021592 | 5.183958e-01 | -0.370693 | -0.162838 | 0.022610 | 0.386157 | 0.567000 | 0.523576 | 0.579681 | 1.000000 | 0.369414 | -0.120304 | 0.348162 | 0.490641 | 0.448341 | -0.423551 | 0.001299 | 0.028415 | 0.259439 | 0.241674 | 0.009892 | -0.020467 | 0.117070 | 0.602752 | 0.019372 |
| MntGoldProds | -1.241657e-02 | -0.065954 | 3.846450e-01 | -0.349750 | -0.021824 | 0.016423 | 0.387116 | 0.392760 | 0.350258 | 0.422630 | 0.369414 | 1.000000 | 0.048995 | 0.421499 | 0.437378 | 0.381226 | -0.250945 | 0.122906 | 0.022048 | 0.180854 | 0.166238 | 0.049911 | -0.029969 | 0.139569 | 0.523954 | 0.057253 |
| NumDealsPurchases | -3.695428e-02 | -0.065923 | -1.075381e-01 | 0.221685 | 0.388462 | -0.000499 | 0.010756 | -0.132427 | -0.122682 | -0.139575 | -0.120304 | 0.048995 | 1.000000 | 0.233956 | -0.008857 | 0.068576 | 0.347480 | -0.023182 | 0.015537 | -0.183390 | -0.123356 | -0.037736 | 0.003665 | 0.002145 | -0.065397 | 0.068281 |
| NumWebPurchases | -1.768482e-02 | -0.152577 | 4.503415e-01 | -0.361915 | 0.155632 | -0.010769 | 0.541899 | 0.296363 | 0.293276 | 0.293309 | 0.348162 | 0.421499 | 0.233956 | 1.000000 | 0.377923 | 0.502200 | -0.056139 | 0.041914 | 0.155693 | 0.138470 | 0.154954 | 0.034086 | -0.013249 | 0.148390 | 0.519420 | 0.153874 |
| NumCatalogPurchases | -2.288528e-03 | -0.127285 | 6.939128e-01 | -0.502529 | -0.110972 | 0.024986 | 0.634960 | 0.487687 | 0.723660 | 0.534257 | 0.490641 | 0.437378 | -0.008857 | 0.377923 | 1.000000 | 0.518320 | -0.520763 | 0.104513 | 0.139041 | 0.322188 | 0.307972 | 0.099780 | -0.018348 | 0.220526 | 0.778427 | 0.125244 |
| NumStorePurchases | -1.326865e-02 | -0.136666 | 6.272741e-01 | -0.499878 | 0.050394 | 0.000882 | 0.641681 | 0.461467 | 0.479189 | 0.459514 | 0.448341 | 0.381226 | 0.068576 | 0.502200 | 0.518320 | 1.000000 | -0.429084 | -0.068033 | 0.179123 | 0.214568 | 0.183025 | 0.085090 | -0.011539 | 0.038801 | 0.674256 | 0.139230 |
| NumWebVisitsMonth | -7.158723e-03 | 0.121202 | -6.468143e-01 | 0.447897 | 0.135255 | -0.021473 | -0.320997 | -0.418645 | -0.539824 | -0.446233 | -0.423551 | -0.250945 | 0.347480 | -0.056139 | -0.520763 | -0.429084 | 1.000000 | 0.061171 | -0.032180 | -0.278200 | -0.192573 | -0.007217 | 0.020857 | -0.004061 | -0.500722 | -0.117408 |
| AcceptedCmp3 | -3.563572e-02 | 0.061368 | -1.532489e-02 | 0.014840 | -0.042823 | -0.033036 | 0.061909 | 0.014545 | 0.018018 | 0.000150 | 0.001299 | 0.122906 | -0.023182 | 0.041914 | 0.104513 | -0.068033 | 0.061171 | 1.000000 | -0.079627 | 0.080219 | 0.094661 | 0.071981 | 0.009916 | 0.254144 | 0.053071 | -0.061105 |
| AcceptedCmp4 | -2.496497e-02 | -0.063000 | 2.174199e-01 | -0.161553 | 0.038790 | 0.018836 | 0.373165 | 0.009967 | 0.102687 | 0.016637 | 0.028415 | 0.022048 | 0.015537 | 0.155693 | 0.139041 | 0.179123 | -0.032180 | -0.079627 | 1.000000 | 0.306452 | 0.251225 | 0.292184 | -0.026978 | 0.176890 | 0.253097 | 0.064261 |
| AcceptedCmp5 | -7.081531e-03 | 0.005793 | 3.953115e-01 | -0.205618 | -0.191288 | 0.000119 | 0.472551 | 0.215697 | 0.373649 | 0.199421 | 0.259439 | 0.180854 | -0.183390 | 0.138470 | 0.322188 | 0.214568 | -0.278200 | 0.080219 | 0.306452 | 1.000000 | 0.403019 | 0.221503 | -0.008356 | 0.326532 | 0.470005 | -0.015381 |
| AcceptedCmp1 | -2.122259e-02 | -0.007387 | 3.249420e-01 | -0.172309 | -0.140288 | -0.019313 | 0.354024 | 0.194618 | 0.309631 | 0.260633 | 0.241674 | 0.166238 | -0.123356 | 0.154954 | 0.307972 | 0.183025 | -0.192573 | 0.094661 | 0.251225 | 0.403019 | 1.000000 | 0.175283 | -0.024913 | 0.293882 | 0.381434 | 0.008147 |
| AcceptedCmp2 | -1.489225e-02 | -0.007252 | 1.039002e-01 | -0.081704 | -0.015664 | -0.001787 | 0.205878 | -0.009852 | 0.042939 | 0.002491 | 0.009892 | 0.049911 | -0.037736 | 0.034086 | 0.099780 | 0.085090 | -0.007217 | 0.071981 | 0.292184 | 0.221503 | 0.175283 | 1.000000 | -0.011074 | 0.169249 | 0.135748 | 0.007622 |
| Complain | 3.162482e-02 | -0.004163 | -2.794846e-02 | 0.036366 | 0.007591 | 0.005380 | -0.036016 | -0.002933 | -0.020810 | -0.018887 | -0.020467 | -0.029969 | 0.003665 | -0.013249 | -0.018348 | -0.011539 | 0.020857 | 0.009916 | -0.026978 | -0.008356 | -0.024913 | -0.011074 | 1.000000 | 0.000185 | -0.033840 | 0.004450 |
| Response | -2.133020e-02 | 0.019520 | 1.606711e-01 | -0.079835 | -0.154730 | -0.198679 | 0.246920 | 0.125048 | 0.236050 | 0.111056 | 0.117070 | 0.139569 | 0.002145 | 0.148390 | 0.220526 | 0.038801 | -0.004061 | 0.254144 | 0.176890 | 0.326532 | 0.293882 | 0.169249 | 0.000185 | 1.000000 | 0.264962 | -0.018554 |
| Total_Purchases | -1.660364e-02 | -0.118099 | 7.895444e-01 | -0.556864 | -0.138899 | 0.020402 | 0.891724 | 0.614073 | 0.842860 | 0.642633 | 0.602752 | 0.523954 | -0.065397 | 0.519420 | 0.778427 | 0.674256 | -0.500722 | 0.053071 | 0.253097 | 0.470005 | 0.381434 | 0.135748 | -0.033840 | 0.264962 | 1.000000 | 0.113501 |
| Age | -2.652311e-03 | -0.992095 | 1.987700e-01 | -0.233855 | 0.363105 | 0.019397 | 0.162739 | 0.013618 | 0.030701 | 0.042341 | 0.019372 | 0.057253 | 0.068281 | 0.153874 | 0.125244 | 0.139230 | -0.117408 | -0.061105 | 0.064261 | -0.015381 | 0.008147 | 0.007622 | 0.004450 | -0.018554 | 0.113501 | 1.000000 |
Count number of columns ["Education"]
plt.figure(figsize=(12,7))
sns.countplot(x=df['Education'])
<AxesSubplot:xlabel='Education', ylabel='count'>
plt.figure(figsize=(12,7))
sns.countplot(x=df['Marital_Status'])
<AxesSubplot:xlabel='Marital_Status', ylabel='count'>
Distribution of (median_house_value")
plt.figure(figsize=(12,6))
sns.distplot(df['Total_Purchases'],color="blue")
plt.title("The graph shows the distribution of (Total_Purchases)")
Text(0.5, 1.0, 'The graph shows the distribution of (Total_Purchases)')
plt.figure(figsize=(10,5))
sns.distplot(df['Year_Birth'],color='red')
plt.title("The graph shows the distribution of (Year_Birth)")
Text(0.5, 1.0, 'The graph shows the distribution of (Year_Birth)')
The distribution of (mean_income)
plt.figure(figsize=(12,8))
sns.distplot(df['Income'],color='purple')
plt.title("The graph shows the distribution of (Income)")
Text(0.5, 1.0, 'The graph shows the distribution of (Income)')
The distribution of (median_income)
plt.figure(figsize=(12,8))
sns.distplot(df['Age'],color='green')
plt.title("The graph shows the distribution of (Age)")
Text(0.5, 1.0, 'The graph shows the distribution of (Age)')
df['Frequency'] = df[['NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases']].sum(axis=1)
df[['ID','Frequency']].head()
| ID | Frequency | |
|---|---|---|
| 0 | 5524 | 25 |
| 1 | 2174 | 6 |
| 2 | 4141 | 21 |
| 3 | 6182 | 8 |
| 4 | 5324 | 19 |
In order to perform proper customer segmentation, we will use the RMF technique, which will be explained further along the project. For that we need the customers' frequency, which is the number of purchases made in a given period of time.
We don't have that value explicitally, but we can calculate it but summing the number of purhases made by our customers through different channels (Deals, Web, etc) and assign it to df['Frequency'].
To perform our customer segmentation the technique to be applied is clustering with SKLearn K-Means.
It is a popular unsupervised machine learning algorithm used for cluster analysis. The algorithm groups similar data points into clusters based ontheir similarity, in order to make better and more accurate business decisions.
Before we proceed with the technique itself, we need to establish the following:
There are a few ways to determine an optimal number or clusters, but we will keep in mind the goal of our project: make better decisions based on customers behavior. In this case, we will define a number of cluster between two and four ( 'n_clusters': [i for i in range(2,5)] ) and let the algorithm tell us which is the optimal number. It wouldn't be viable to have 10, 20, clusters because we need to establish a limited number of groups in order to escalate our business actions.
That said, we will establish three customers features to create clusters for each of those features:
pd.set_option('display.max_columns', 29)
df.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | ... | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Total_Purchases | Age | Frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635 | 88 | 546 | 172 | 88 | ... | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1617 | 66 | 25 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11 | 1 | 6 | 2 | 1 | ... | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27 | 69 | 6 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | 49 | 127 | 111 | 21 | ... | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 776 | 58 | 21 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11 | 4 | 20 | 10 | 3 | ... | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 53 | 39 | 8 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | 43 | 118 | 46 | 27 | ... | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 422 | 42 | 19 |
5 rows × 30 columns
df['Frequency'] = df[['NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases']].sum(axis=1)
df[['ID','Frequency']].head()
| ID | Frequency | |
|---|---|---|
| 0 | 5524 | 25 |
| 1 | 2174 | 6 |
| 2 | 4141 | 21 |
| 3 | 6182 | 8 |
| 4 | 5324 | 19 |
To perform our customer segmentation the technique to be applied is clustering with SKLearn K-Means.
It is a popular unsupervised machine learning algorithm used for cluster analysis. The algorithm groups similar data points into clusters based on their similarity, in order to make better and more accurate business decisions.
Before we proceed with the technique itself, we need to establish the following:
There are a few ways to determine an optimal number or clusters, but we will keep in mind the goal of our project: make better decisions based on customers behavior. In this case, we will define a number of cluster between two and four ( 'n_clusters': [i for i in range(2,5)] ) and let the algorithm tell us which is the optimal number. It wouldn't be viable to have 10, 20, clusters because we need to establish a limited number of groups in order to escalate our business actions.
That said, we will establish three customers features to create clusters for each of those features:
Let's get to it
SEED = 1234
np.random.seed(SEED)
kmeans = KMeans()
parameters = {'n_clusters': [i for i in range(2,5)], 'init':['k-means++', 'random'],
'n_init':[10,20,30,40,50], 'max_iter':[300,500,700,100]}
random_search = RandomizedSearchCV(kmeans, parameters, random_state=SEED)
Fit Frequency
random_search.fit(df[['Frequency']])
print("Best hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)
Best hyperparameters: {'n_init': 30, 'n_clusters': 4, 'max_iter': 500, 'init': 'random'}
Best score: -1989.5993461036724
Fit Total_Purchases
random_search.fit(df[['Total_Purchases']])
print("Best hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)
Best hyperparameters: {'n_init': 30, 'n_clusters': 4, 'max_iter': 500, 'init': 'random'}
Best score: -9340371.11741151
Fit Recency
random_search.fit(df[['Recency']])
print("Best hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)
Best hyperparameters: {'n_init': 30, 'n_clusters': 4, 'max_iter': 500, 'init': 'random'}
Best score: -23783.01205702912
kmeans1 = KMeans(n_init= 30, n_clusters = 4, max_iter = 500, init= 'random')
kmeans2 = KMeans(n_init= 30, n_clusters = 4, max_iter = 500, init= 'random')
kmeans3 = KMeans(n_init= 10, n_clusters = 4, max_iter = 300, init= 'random')
df['Cluster_Frequency'] = kmeans1.fit_predict(df[['Frequency']])
df['Cluster_Purchases'] = kmeans2.fit_predict(df[['Total_Purchases']])
df['Cluster_Recency'] = kmeans3.fit_predict(df[['Recency']])
df[['ID','Cluster_Recency', 'Cluster_Purchases', 'Cluster_Recency']].head()
| ID | Cluster_Recency | Cluster_Purchases | Cluster_Recency | |
|---|---|---|---|---|
| 0 | 5524 | 3 | 0 | 3 |
| 1 | 2174 | 1 | 3 | 1 |
| 2 | 4141 | 1 | 2 | 1 |
| 3 | 6182 | 1 | 3 | 1 |
| 4 | 5324 | 2 | 2 | 2 |
RMF, or Recency, Monetary, Frequency, is a technique used in customer relationship management and marketing to segment customers based on their purchasing behavior. The three components of the RMF technique are:
Now that we have the clusters of our customers assigned to each of these components, we will group them by the average value of these components, and assign a component score for each cluster, ranging from 0 to 3 . The higher the average of each cluster for Monetary and
Frequency , the higher the score, and the lower for Recency , the higher the score
Reset Index Frequency
frequency_group = df.groupby('Cluster_Frequency')['Frequency'].mean().sort_values(ascending=True).to_frame().reset_index()
frequency_group['index'] = frequency_group.index
frequency_group
| Cluster_Frequency | Frequency | index | |
|---|---|---|---|
| 0 | 3 | 6.612048 | 0 |
| 1 | 0 | 14.460076 | 1 |
| 2 | 1 | 20.518182 | 2 |
| 3 | 2 | 26.861027 | 3 |
Reset Index Recency
recency_group = df.groupby('Cluster_Recency')['Recency'].mean().sort_values(ascending=False).to_frame().reset_index()
recency_group['index'] = recency_group.index
recency_group
| Cluster_Recency | Recency | index | |
|---|---|---|---|
| 0 | 2 | 84.622699 | 0 |
| 1 | 3 | 56.898305 | 1 |
| 2 | 1 | 31.922780 | 2 |
| 3 | 0 | 9.601677 | 3 |
Reset Index Total_Purchases
purchases_group = df.groupby('Cluster_Purchases')['Total_Purchases'].mean().sort_values(ascending=True).to_frame().reset_index()
purchases_group['index'] = purchases_group.index
purchases_group
| Cluster_Purchases | Total_Purchases | index | |
|---|---|---|---|
| 0 | 3 | 100.580374 | 0 |
| 1 | 2 | 575.307159 | 1 |
| 2 | 1 | 1121.424837 | 2 |
| 3 | 0 | 1765.003636 | 3 |
DataFrame RFM Score
frequency_score = {2:0,0:1,1:2,3:3}
recency_score = {0:0, 2:1, 1:2, 3:3}
monetary_score = {2:0,1:1,3:2,0:3}
df['Frequency_Score'] = df['Cluster_Frequency'].map(frequency_score)
df['Recency_Score'] = df['Cluster_Recency'].map(recency_score)
df['Monetary_Score'] = df['Cluster_Purchases'].map(monetary_score)
df[['Frequency_Score','Recency_Score', 'Monetary_Score']].head()
| Frequency_Score | Recency_Score | Monetary_Score | |
|---|---|---|---|
| 0 | 0 | 3 | 3 |
| 1 | 3 | 2 | 2 |
| 2 | 2 | 2 | 0 |
| 3 | 3 | 2 | 2 |
| 4 | 2 | 1 | 0 |
df['RMF_Score'] = df[['Frequency_Score','Recency_Score', 'Monetary_Score']].sum(axis = 1)
df[['ID','RMF_Score']]
| ID | RMF_Score | |
|---|---|---|
| 0 | 5524 | 6 |
| 1 | 2174 | 7 |
| 2 | 4141 | 4 |
| 3 | 6182 | 7 |
| 4 | 5324 | 3 |
| ... | ... | ... |
| 2232 | 10870 | 6 |
| 2233 | 4001 | 5 |
| 2234 | 7270 | 4 |
| 2235 | 8235 | 2 |
| 2236 | 9405 | 5 |
2237 rows × 2 columns
Let assign to each costumer a segmenation, based on their scores
RMF Score = 0 -> Inactive
RMF Score = 1 or 2 -> Occasional
RMF Score = 3 or 4 -> Moderate
RMF Score = 5 or 5 -> Loyal
RMF Score = 7 , 8 , or 9 -> Premium
df['Segmentation'] = 'Inactive'
df.loc[df['RMF_Score']>=1, 'Segmentation'] = 'Occasional'
df.loc[df['RMF_Score']>=3, 'Segmentation'] = 'Moderate'
df.loc[df['RMF_Score']>=5, 'Segmentation'] = 'Loyal'
df.loc[df['RMF_Score']>=7, 'Segmentation'] = 'Premium'
sns.countplot(df['Segmentation'])
plt.title('Number of customers by Segmentation')
plt.show()
df['Segmentation'].value_counts().sort_values(ascending=False).to_frame()
| Segmentation | |
|---|---|
| Loyal | 871 |
| Moderate | 625 |
| Premium | 481 |
| Occasional | 239 |
| Inactive | 21 |
Above, you can see the number of clients by segmentation
Let's now plot a 3 d Sctterplot showing the clusters and its componet and the three components of our score
fig = px.scatter_3d(df, x='Frequency', y='Recency', z = 'Total_Purchases', color='Segmentation')
fig.show()
It is important to notice that, unlike other situations in which we use clustering techniques, the clusters were not as well defined like it may happen in a recommendation system, for example, but the intelligence is useful for business decision-making.
It is clear that:
Below, you can see four plots that show in detail what has been stated above
plt.figure(figsize=(20,7))
sns.relplot(data=df,x='Frequency',y='Total_Purchases',col='Recency_Score',hue="Segmentation")
<seaborn.axisgrid.FacetGrid at 0x16286412e50>
<Figure size 2000x700 with 0 Axes>
In the Exploratory Data Anlysis we have come to find out that meat and wine are our champions and are even more correlated to the customers income than Gold.
After the clustering and segmentation, does this pattern remain? Let's see below
df.groupby('Segmentation')[["MntWines","MntFruits",'MntMeatProducts','MntFishProducts',
'MntGoldProds']].sum().plot(kind="bar",figsize=(16,8))
plt.title("Total Amount Apent On Products By Segmentation")
plt.xticks(rotation=0)
plt.show()
df.groupby('Segmentation')[['MntWines','MntFruits','MntFruits',
'MntMeatProducts','MntFishProducts',
'MntGoldProds']].mean().plot(kind='bar',figsize=(16,8))
plt.title("Relative Amount Spent On Products By Segmentation")
plt.xticks(rotation=0)
plt.show()
Premium customer have that title for a reason, don't you agree?
It's worthy of note that there is a clear pattern, that varies according to other factor, some yet to be uncovered by further analysis, but it's clear that meat and wine have the most notable perfomance in the products mix.
Let's now observe the average income by segmentation.
fig=plt.figure()
df.groupby("Segmentation")['Income'].mean().plot(kind='bar',figsize=(12,8),color= ['#003049','#D62828','#F77F00', '#2274A5', '#1F7A8C' ])
plt.title("Average Income By Segmentation")
plt.xticks(rotation=0)
plt.xlabel("Segmentation",labelpad=8)
plt.show()
df_sum=df.groupby("Segmentation")['Total_Purchases'].sum()
df_mean=df.groupby("Segmentation")['Total_Purchases'].mean()
fig=plt.figure(figsize=(24,12))
ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
df_sum.plot(kind='bar', color = ['#003049','#D62828','#F77F00', '#2274A5', '#1F7A8C' ], ax=ax0)
ax0.set_xticks([i for i in range(5)], df_sum.index, rotation=0, fontweight='bold', fontsize='large')
df_mean.plot(kind='bar', color = ['#006BA6','#0496FF','#FFBC42', '#D81159', '#2CA58D' ], ax=ax1)
ax1.set_title('Relative Amount in Purchases by Segmentation', fontweight='bold', fontsize='large')
ax1.set_xlabel('Segmentation', labelpad=8, fontweight='bold', fontsize='large')
ax1.set_xticks([i for i in range(5)], df_mean.index, rotation=0, fontweight='bold', fontsize='large')
plt.show()
df_campaigns=df.groupby("Segmentation")[['AcceptedCmp3',"AcceptedCmp4",'AcceptedCmp5','AcceptedCmp1',
'AcceptedCmp2']].sum().transpose().reset_index()
df_campaigns.index.rename('Index',inplace=True)
df_campaigns.rename(columns={"index":"Segmentation"},inplace=True)
fig=px.bar(df_campaigns,x='Segmentation',y=['Inactive','Loyal','Moderate','Occasional','Premium'],title="Campaign performance by Segmentation")
fig.show()
Now this is a topic that deserves thotough explanation. We can see that:
Average Purchase Value (APV) is the average amount spent by a customer on a single transaction( df['Total_Purchases']/df['Frequency'] ).
Let's create the column corresponding to this feature and the, group the customer bysegmentation and see the Average Purchase Value for each category
Calculate the Average Purchase Value (APV)
df['APV']=df['Total_Purchases']/df['Frequency']
df.groupby('Segmentation')['APV'].mean()
Segmentation Inactive 25.301070 Loyal NaN Moderate 39.873958 Occasional 37.470431 Premium NaN Name: APV, dtype: float64
Frequency assignment
df[df['Frequency']==0]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | ... | Complain | Response | Total_Purchases | Age | Frequency | Cluster_Frequency | Cluster_Purchases | Cluster_Recency | Frequency_Score | Recency_Score | Monetary_Score | RMF_Score | Segmentation | APV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 653 | 5555 | 1975 | Graduation | Divorced | 153924.0 | 0 | 0 | 2014-02-07 | 81 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 6 | 48 | 0 | 3 | 3 | 2 | 3 | 1 | 2 | 6 | Loyal | inf |
| 979 | 3955 | 1965 | Graduation | Divorced | 4861.0 | 0 | 0 | 2014-06-22 | 20 | 2 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 6 | 58 | 0 | 3 | 3 | 0 | 3 | 0 | 2 | 5 | Loyal | inf |
| 1522 | 11110 | 1973 | Graduation | Single | 3502.0 | 1 | 0 | 2013-04-13 | 56 | 2 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 5 | 50 | 0 | 3 | 3 | 3 | 3 | 3 | 2 | 8 | Premium | inf |
| 2130 | 11181 | 1949 | PhD | Married | 156924.0 | 0 | 0 | 2013-08-29 | 85 | 2 | 1 | 2 | 1 | 1 | ... | 0 | 0 | 8 | 74 | 0 | 3 | 3 | 2 | 3 | 1 | 2 | 6 | Loyal | inf |
4 rows × 39 columns
Yes, it seems to be the case. As you can see above, these customers have a poor relationship with the company, hence, their rows can be dropped and yet, the integrity of our analysis will be kept
df.drop([653,979,1552,2130],inplace=True)
df[df['Frequency']==0]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | ... | Complain | Response | Total_Purchases | Age | Frequency | Cluster_Frequency | Cluster_Purchases | Cluster_Recency | Frequency_Score | Recency_Score | Monetary_Score | RMF_Score | Segmentation | APV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1522 | 11110 | 1973 | Graduation | Single | 3502.0 | 1 | 0 | 2013-04-13 | 56 | 2 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 5 | 50 | 0 | 3 | 3 | 3 | 3 | 3 | 2 | 8 | Premium | inf |
1 rows × 39 columns
Table Segmentation and Average Purchase Value (APV)
df.groupby("Segmentation")['APV'].mean().to_frame()
| APV | |
|---|---|
| Segmentation | |
| Inactive | 25.301070 |
| Loyal | 31.696840 |
| Moderate | 39.879065 |
| Occasional | 37.470431 |
| Premium | NaN |
plt.figure(figsize=(12,8))
sns.scatterplot(data=df,x='APV',y='Income',hue='Segmentation')
plt.axhline(y=df['Income'].mean(),color='r',linestyle='--')
plt.axvline(x=df['APV'].mean(),color='r',linestyle='--')
plt.title('Income x APV')
plt.show()
df[df['APV']==df['APV'].max()]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | ... | Complain | Response | Total_Purchases | Age | Frequency | Cluster_Frequency | Cluster_Purchases | Cluster_Recency | Frequency_Score | Recency_Score | Monetary_Score | RMF_Score | Segmentation | APV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1522 | 11110 | 1973 | Graduation | Single | 3502.0 | 1 | 0 | 2013-04-13 | 56 | 2 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 5 | 50 | 0 | 3 | 3 | 3 | 3 | 3 | 2 | 8 | Premium | inf |
1 rows × 39 columns